Introduction

This application would allow the City of Chicago to predict food inspection failures the year before inspectors are allocated to restaurants and other food vendors. The City’s Health Department would use this application to prepare their plans for the distribution of health inspectors across the city.

With over 19,000 inspections taking place in Chicago in 2019 alone, the sheer volume of this task necessitates prediction-aided planning. Restaurants failing inspection are legally required to receive an additional visit by the Chicago Health Department, so it is especially important for the city to be aware of locations where inspection failures might be common.

Exploratory Analysis

After reading in Chicago Health Inspection data, total inspections for the year 2018 and a map of their density is displayed below. There are tens of thousands of inspections for 2018, with just under 20% of them failed inspections.

Preliminary Maps

# Create a map containing boundary of Chicago.
chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 

# Create a fishnet containing 500 by 500 foot squares over the boundary of Chicago.
fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%            # <- MDH Added
  st_sf() %>%
  mutate(uniqueID = rownames(.))

## Neighborhoods to use 
neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet))

Food <-
  read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5") 

FoodInspect <- Food %>%
    mutate(year = substr(inspection_date,1,4)) %>% 
    filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Food_Inspect")
# uses grid.arrange to organize indpendent plots
grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = FoodInspect, colour="red", size=0.1, show.legend = "point") +
  labs(title= "Food Inspections, Chicago - 2018") +
  mapTheme(title_size = 14),

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey80") +
  stat_density2d(data = data.frame(st_coordinates(FoodInspect)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Food Inspections") +
  mapTheme(title_size = 14) + theme(legend.position = "none"))

All failed inspections for the year 2018 and a map of their density is displayed below. It makes sense that these maps are so similar to the previous maps, as failed inspections are likely to show up where there are more inspections to begin with. Regional patterns do emerge in the comparison between total inspections and failed inspections, however - inspections in the visible clusters to the north and to the west of the loop are more likely to fail.

FoodInspect <- Food %>%
    mutate(year = substr(inspection_date,1,4)) %>% 
    filter(year == "2018") %>%
    filter(results=="Fail") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Food_Inspect")

grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = FoodInspect, colour="red", size=0.1, show.legend = "point") +
  labs(title= "Failed Food Inspections, Chicago - 2018") +
  mapTheme(title_size = 14),

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey80") +
  stat_density2d(data = data.frame(st_coordinates(FoodInspect)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Failed Food Inspections") +
  mapTheme(title_size = 14) + theme(legend.position = "none"))

Fishnet

The model will predict the locations of future failed inspections with a fishnet overlayed onto Chicago. Three distinct clusters are immediately visible - the loop, and clusters to the west and north. Note that this map does not show whether an area has a higher rate of restaurant failures. The sum of failures (rather than the rate) is considered because the sum corresponds most directly to the Chicago Health Department needing to expend resources.

## Add a value of 1 to each crime, sum them with aggregate
inspect_net <- 
  dplyr::select(FoodInspect) %>% 
  mutate(countInspect = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countInspect = replace_na(countInspect, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

# Plot the fishnet displaying the count of inspection failures
ggplot() +
  geom_sf(data = inspect_net, aes(fill = countInspect), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of Inspection Failures for the Fishnet") +
  mapTheme()

Pulling Risk Factors

The following risk factors are considered to be potential predictors of Failed Food Inspections:

  1. Rodent Complaints (311 service requests for rodent baiting)
  2. Sanitation (311 service requests due to sanitation code complaints)
  3. Ordinance Violation (Department of Buildings)
  4. Graffiti (311 service requests for removal)
  5. Liquor Retail (liquor store locations)
  6. Previous Year Food Inspection Failures
  7. Prediction Year Food Inspection All Locations*

_*When making predictions, this will data will come from the year whose potential failures are being predicted. It is assumed that this location-based information will be available since these addresses must be known for inspections to take place in the first placed._

## using Socrata again
rodentBait <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting-Historic   al/97t6-zrhs") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Rodent_Bait")

sanitation311 <-
  read.socrata(paste0("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac")) %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2018") %>%
  filter(what_is_the_nature_of_this_code_violation_ %in%
  c("Garbage in Yard","Garbage in Alley","Dumpster not being emptied", "Overflowing carts", "Construction Site Cleanliness/Fence", "Standing water")) %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Sanitation_311")


ordinanceViolation <- 
  read.socrata("https://data.cityofchicago.org/Administration-Finance/Ordinance-Violations-Buildings-/awqx-tuwv") %>%
    mutate(year = substr(VIOLATION.DATE,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Ordinance _Violations")


graffiti <-
  read.socrata(paste0("https://data.cityofchicago.org/Service",
  "-Requests/311-Service-Requests-Graffiti-Removal-Historical/",
  "hec5-y4x5")) %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  filter(where_is_the_graffiti_located_ %in%
  c("Front","Rear","Side")) %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Graffiti")


# streetLightsOut <-
#   read.socrata(paste0("https://data.cityofchicago.org/Service",
#   "-Requests/311-Service-Requests-Street-Lights-All-Out/",
#   "zuxi-7xem")) %>%
#   mutate(year = substr(creation_date,1,4)) %>%
#   filter(year == "2017") %>%
#   dplyr::select(Y = latitude, X = longitude) %>%
#   na.omit() %>%
#   st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
#   st_transform(st_crs(fishnet)) %>%
#   mutate(Legend = "Street_Lights_Out")


liquorRetail <-
  read.socrata(paste0("https://data.cityofchicago.org/resource/",
  "nrmj-3kcf.json")) %>%
  filter(business_activity ==
  "Retail Sales of Packaged Liquor") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Liquor_Retail")


foodInspectFail <-
  read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5") %>%
    mutate(year = substr(inspection_date,1,4)) %>% filter(year == "2017") %>%
    filter(results=="Fail") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Food_Inspect")

foodInspectLocations <-
  read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5") %>%
    mutate(year = substr(inspection_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Food_Inspect")

The multi-map below shows each risk factor mapped individually on fishnets:

vars_net <- rbind(rodentBait, sanitation311, ordinanceViolation, graffiti, liquorRetail, foodInspectFail, foodInspectLocations)   %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet, by = "uniqueID") %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit() %>%
  ungroup()


vars_net.long <-
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
  mapList1 <- list()
  
for(i in vars){
  mapList1[[i]] <-
  ggplot() +
  geom_sf(data = filter(vars_net.long, Variable == i),
  aes(fill=value), colour=NA) +
  scale_fill_viridis(name="") +
  labs(title=i) +
  mapTheme()}
  
do.call(grid.arrange,c(mapList1, ncol=3,
top="Risk Factors by Fishnet"))

Modeling Spatial Features

The next chunk of code converts risk factors to nearest neighbor distances. This feature allows us to look at risk factors in a spatially “smoothed” manner. For all risk factors we pick k=3, indicating that we want an average distance of nearest 3 occurrences of the event. The nearest neighbor plots follow:

# convinience to reduce length of function names.
st_c    <- st_coordinates
st_coid <- st_centroid

## create NN from rodent bait
vars_net <- vars_net %>%
    mutate(rodent_Bait.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(rodentBait),
                                           k = 3))

## create NN from sanitation 311
vars_net <- vars_net %>%
    mutate(sanitation_311.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(sanitation311),
                                           k = 3))

## create NN from ordinance violations
vars_net <- vars_net %>%
    mutate(ordinance_Violation.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(ordinanceViolation),
                                           k = 3))

## create NN from graffiti
vars_net <- vars_net %>%
    mutate(graffiti.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(graffiti),
                                           k = 3))

## create NN from liquorRetail
vars_net <- vars_net %>%
    mutate(liquor_Retail.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(liquorRetail),
                                           k = 3))

## create NN from food_Inspect_Fail
vars_net <- vars_net %>%
    mutate(food_Inspect_Fail.nn = nn_function(st_c(st_coid(vars_net)),
                                           st_c(foodInspectFail),
                                           k = 3))

## create NN from food_Inspect_Locations
vars_net <- vars_net %>%
    mutate(food_Inspect_Locations.nn = nn_function(st_c(st_coid(vars_net)),
                                           st_c(foodInspectLocations),
                                           k = 3))



vars_net.long.nn <-
  dplyr::select(vars_net, ends_with(".nn")) %>%
  gather(Variable, value, -geometry)
  vars <- unique(vars_net.long.nn$Variable)

mapList2 <- list()

for(i in vars){
  mapList2[[i]] <-
  ggplot() +
  geom_sf(data = filter(vars_net.long.nn, Variable == i),
  aes(fill=value), colour=NA) +
  scale_fill_viridis(name="") +
  labs(title=i) +
mapTheme()}

do.call(grid.arrange,c(mapList2, ncol = 3,
top = "Nearest Neighbor risk Factors by Fishnet"))

#Distance to Loop to be used later
loopPoint <-
  filter(neighborhoods, name == "Loop") %>%
  st_centroid()
  vars_net$loopDistance =
  st_distance(st_centroid(vars_net), loopPoint) %>%
  as.numeric()

The plot below combines the 3-NN versions of risk factors into a single map:

# Visualize the NN feature
vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

# Display plot
ggplot() +
      geom_sf(data = vars_net.long.nn, aes(fill=value), colour=NA) +
      scale_fill_viridis(name="3-NN Distance") +
      labs(title="Risk Factors 3-NN Distance") +
      mapTheme()

Since the counts were aggregated to each cell by uniqueID we can use that to join the counts to the fishnet.

## important to drop the geometry from joining features
final_net <-
  left_join(inspect_net, st_drop_geometry(vars_net), by="uniqueID") 

Join in areal data

We use spatial joins to attach centroids of fishnets to polygon for neighborhoods The map below shows the fishnet as such:

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

# for live demo
mapview::mapview(final_net, zcol = "name")
## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods... 
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

# print(final_net.weights, zero.policy=TRUE)

Plotting local Moran’s I results

The grid map below shows the following four elements:

  1. Count of failed inspections in Chicago
  2. Local Morans I’s: dispersal or clustering of failed inspections relative to adjacent neighborhoods
  3. P-value: statistical significance
  4. Significant Hotspots: major clusters of failed inspections in Chicago at 99% significance level
## see ?localmoran
local_morans <- localmoran(final_net$countInspect, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

# join local Moran's I results to fishnet
final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(countInspect = countInspect, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)


# varList1 <- list()
# 
# for(i in vars){
#   varList1[[i]] <-
#   ggplot() +
#     geom_sf(data = filter(final_net.localMorans, Variable==i),
#     aes(fill = Value), colour=NA) +
#     scale_fill_viridis(name="") +
#     labs(title=i) +
#     mapTheme() + theme(legend.position="bottom")}
# 
# do.call(grid.arrange,c(varList1, ncol = 2,
#   top = "Local Morans I statistics, Theft"))
## This is just for plotting
vars <- unique(final_net.localMorans$Variable)
varList2 <- list()

for(i in vars){
  varList2[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme(title_size = 14) + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList2, ncol = 4, top = "Local Morans I statistics, Inspect"))

Distance to Hot spot

Once again, we’ll use the nearest neighbor function to find the distance to a hot spot location.

# generates warning from NN
final_net <- final_net %>% 
  mutate(Inspect.isSig = 
           ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
  mutate(Inspect.isSig.dist = 
           nn_function(st_c(st_coid(final_net)),
                       st_c(st_coid(filter(final_net, 
                                           Inspect.isSig == 1))), 
                       k = 1))
ggplot() +
      geom_sf(data = final_net, aes(fill=Inspect.isSig.dist), colour=NA) +
      scale_fill_viridis(name="Inspect.isSig.dist") +
      labs(title="NN Distance to Highly Significant \nFailed Inspect Clusters") +
      mapTheme()

We’ll likewise consider the distance to the main business area and the center of Chicago–the Loop. See map below.

ggplot() +
    geom_sf(data = vars_net, aes(fill=loopDistance), colour=NA) +
    scale_fill_viridis(name="loopDistance") +
    labs(title="Distance to the Loop") +
    mapTheme()

Correlations

The plots below show linear correlations between various features of risk factors and failed inspections in Chicago.

ADDITIONAL ANALYSIS GOES HERE

correlation.long <-
  st_drop_geometry(final_net) %>%
  dplyr::select(-uniqueID, -cvID, -name, -loopDistance, -Inspect.isSig, -Inspect.isSig.dist) %>%
  gather(Variable, Value, -countInspect)


correlation.cor <-
  correlation.long %>%
  group_by(Variable) %>%
  summarize(correlation = cor(Value, countInspect, use = "complete.obs"))

ggplot(correlation.long, aes(Value, countInspect)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor,
  aes(label = paste("r =", round(correlation, 2))),
  x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "black") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Failed inspection count as a function of risk factors") +
plotTheme()

Histogram

The histogram below provides a view of the frequency of failed inspections by neighbiorhood.

# calculate errors by NEIGHBORHOOD
HistogramInspect <-
  final_net %>%
    group_by(cvID) %>%
    summarize(countInspect_sum = sum(countInspect, na.rm = T),
              countInspect_mean = mean(countInspect, na.rm = T)) %>%
  ungroup()

# error_by_reg_and_fold %>%
#   arrange(desc(MAE))
# error_by_reg_and_fold %>%
#   arrange(MAE)

## plot histogram of theft
HistogramInspect %>%
  ggplot(aes(countInspect_sum)) +
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
    scale_x_continuous(breaks = seq(30, 475, by = 25)) +
    labs(title="Distribution of Failed Inspection Counts", subtitle = "Chicago, IL",
         x="Theft Count", y="Count")

Modeling

The model below uses all seven risk factors as nearest neighbor functions, along with spatial features where failed inspections are significant, as independent variables. Count of failed inspections is taken as the dependent variable. The model cycles through each neighborhood as a hold out, then trains on the remaining cells. Cross-validation is then used to generate predictions for the hold out.

We generate four variations of the model: risk factors with and without spatial features, k-fold and spatial cross-validation.

# necessary workaround for hardcoded "countBurglaries" 
# redefine function
remove(crossValidate)
crossValidate <- function(dataset, id, dependentVariable, indVariables) {

 allPredictions <- data.frame()
 cvID_list <- unique(dataset[[id]])

 for (i in cvID_list) {
 
  thisFold <- i
  #cat("This hold out fold is", thisFold, "\n")
 
  fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
   dplyr::select(id, geometry, indVariables, dependentVariable)
  fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
   dplyr::select(id, geometry, indVariables, dependentVariable)
 
  regression <-
    glm(paste0(dependentVariable,"~."), family = "poisson",
     data = fold.train %>%
      dplyr::select(-geometry, -id))
  thisPrediction <-
   mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
 
  allPredictions <-
   rbind(allPredictions, thisPrediction)
 
 }
 return(st_sf(allPredictions))
}



# View(crossValidate)

## define independent variables
reg.ss.vars <- c("rodent_Bait.nn", "sanitation_311.nn","ordinance_Violation.nn", "graffiti.nn", "liquor_Retail.nn", "food_Inspect_Fail.nn", "food_Inspect_Locations.nn", "Inspect.isSig", "Inspect.isSig.dist", "loopDistance")

## define independent variables with spatial features
reg.vars <-
c("rodent_Bait.nn", "sanitation_311.nn","ordinance_Violation.nn", "graffiti.nn", "liquor_Retail.nn", "food_Inspect_Fail.nn", "food_Inspect_Locations.nn")

## RUN REGRESSIONS
reg.ss.spatialCV <- crossValidate(
  dataset = dplyr::rename(final_net, countFailures = countInspect),
  id = "name",
  dependentVariable = "countFailures",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countFailures, Prediction, geometry)


reg.cv <- crossValidate(
  dataset = dplyr::rename(final_net, countFailures = countInspect),
  id = "cvID",
  dependentVariable = "countFailures",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countFailures, Prediction, geometry)


reg.ss.cv <- crossValidate(
  dataset = dplyr::rename(final_net, countFailures = countInspect),
  id = "cvID",
  dependentVariable = "countFailures",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = cvID, countFailures, Prediction, geometry)


reg.spatialCV <- crossValidate(
  dataset = dplyr::rename(final_net, countFailures = countInspect),
  id = "name",
  dependentVariable = "countFailures",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = name, countFailures, Prediction, geometry)

Cross-validation

Model errors are plotted in a histogram below. The mode of mean absolute errors is 2 thefts. The model can benefit from further refinement based on the presence of neighborhoods with significant errors.

Most grievous errors occur in neighborhoods near the Loop and the…

Comparing mean errors between models shows that spatial features improve our models significantly, compared to using just the risk factors. The comparison map confirms this finding.

reg.summary <-rbind(
  mutate(reg.cv,
  Error = Prediction - countFailures,
  Regression = "Random k-fold CV: Just Risk Factors"),

  mutate(reg.ss.cv,
  Error = Prediction - countFailures,
  Regression = "Random k-fold CV: Spatial Process"),

  mutate(reg.spatialCV,
  Error = Prediction - countFailures,
  Regression = "Spatial LOGO-CV: Just Risk Factors"),

  mutate(reg.ss.spatialCV,
  Error = Prediction - countFailures,
  Regression = "Spatial LOGO-CV: Spatial Process")) %>%
st_sf()


# calculate errors by NEIGHBORHOOD
error_by_reg_and_fold <-
  reg.summary %>%
  group_by(Regression, cvID) %>%
    summarize(Mean_Error = mean(Prediction - countFailures, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>%
  arrange(desc(MAE))
## Simple feature collection with 396 features and 5 fields
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 341164.1 ymin: 552875.4 xmax: 367164.1 ymax: 594875.4
## projected CRS:  NAD_1983_HARN_StatePlane_Illinois_East_FIPS_1201
## # A tibble: 396 x 6
##    Regression     cvID   Mean_Error   MAE SD_MAE                        geometry
##    <chr>          <chr>       <dbl> <dbl>  <dbl>                  <GEOMETRY [m]>
##  1 Spatial LOGO-… Ander…      -6.30  6.30   6.30 POLYGON ((355164.1 589875.4, 3…
##  2 Spatial LOGO-… Loop        -6.29  6.29   6.29 POLYGON ((358164.1 578375.4, 3…
##  3 Spatial LOGO-… Wrigl…      -5.82  5.82   5.82 POLYGON ((356164.1 586875.4, 3…
##  4 Spatial LOGO-… Mille…      -5.42  5.42   5.42 POLYGON ((359164.1 578875.4, 3…
##  5 Spatial LOGO-… Ander…      -5.40  5.40   5.40 POLYGON ((355164.1 589875.4, 3…
##  6 Spatial LOGO-… China…      -4.46  4.46   4.46 POLYGON ((357664.1 575375.4, 3…
##  7 Spatial LOGO-… Stree…      -4.41  4.41   4.41 MULTIPOLYGON (((359164.1 57987…
##  8 Spatial LOGO-… Rush …       4.18  4.18   4.18 POLYGON ((358664.1 581375.4, 3…
##  9 Spatial LOGO-… Mille…      -3.82  3.82   3.82 POLYGON ((359164.1 578875.4, 3…
## 10 Spatial LOGO-… Boyst…      -3.75  3.75   3.75 POLYGON ((356664.1 585875.4, 3…
## # … with 386 more rows
# error_by_reg_and_fold %>%
#   arrange(MAE)

## plot histogram of OOF (out of fold) errors
error_by_reg_and_fold %>%
  ggplot(aes(MAE)) +
  geom_histogram(bins = 30, colour="black", fill="#FDE725FF") +
  facet_wrap(~Regression) +
  geom_vline(xintercept = 0) + scale_x_continuous(
  breaks = seq(0, 8, by = 1)) +
  labs(title="Distribution of MAE",
  subtitle = "k-fold cross-validation vs. LOGO-CV",
  x="Mean Absolute Error", y="Count") +
plotTheme()

## Errors by Model
st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>%
  summarize(Mean_MAE = round(mean(MAE), 2),
  SD_MAE = round(sd(MAE), 2)) %>%
  kable() %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(2, color = "black", background = "#FDE725FF") %>%
  row_spec(4, color = "black", background = "#FDE725FF")
Regression Mean_MAE SD_MAE
Random k-fold CV: Just Risk Factors 0.38 0.34
Random k-fold CV: Spatial Process 0.36 0.33
Spatial LOGO-CV: Just Risk Factors 0.86 1.37
Spatial LOGO-CV: Spatial Process 0.66 0.98
## map of errors by neighborhood
error_by_reg_and_fold %>%
  filter(str_detect(Regression, "LOGO")) %>%
  ggplot() +
  geom_sf(aes(fill = MAE)) +
  facet_wrap(~Regression) +
  scale_fill_viridis() +
  labs(title = "Theft errors by LOGO-CV Regression") +
mapTheme() + theme(legend.position="bottom")

## neighborhood weights

neighborhood.weights <-
  filter(error_by_reg_and_fold,
  Regression == "Spatial LOGO-CV: Spatial Process") %>%
  group_by(cvID) %>%
  poly2nb(as_Spatial(.), queen=TRUE) %>%
  nb2listw(., style="W", zero.policy=TRUE)
    filter(error_by_reg_and_fold, str_detect(Regression, "LOGO")) %>%
  st_drop_geometry() %>%
  group_by(Regression) %>%
  summarize(Morans_I =
  moran.mc(abs(Mean_Error), neighborhood.weights,
    nsim = 999, zero.policy = TRUE,
      na.action=na.omit)[[1]],
    p_value =
      moran.mc(abs(Mean_Error), neighborhood.weights,
    nsim = 999, zero.policy = TRUE,
    na.action=na.omit)[[3]])
## # A tibble: 2 x 3
##   Regression                         Morans_I p_value
## * <chr>                                 <dbl>   <dbl>
## 1 Spatial LOGO-CV: Just Risk Factors    0.213   0.006
## 2 Spatial LOGO-CV: Spatial Process      0.255   0.001

Density vs Predictions

Next we’ll use kernel density with varying search radii: 1000, 1500 and 2000 feet. Map below shows the visualization by 3 different search radii.

# demo of kernel width
inspect_ppp <- as.ppp(st_coordinates(FoodInspect), W = st_bbox(final_net))
inspect_KD.1000 <- spatstat.core::density.ppp(inspect_ppp, 1000)
inspect_KD.1500 <- spatstat.core::density.ppp(inspect_ppp, 1500)
inspect_KD.2000 <- spatstat.core::density.ppp(inspect_ppp, 2000)
inspect_KD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(inspect_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(inspect_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(inspect_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft."))

inspect_KD.df$Legend <- factor(inspect_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))

ggplot(data=inspect_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) +
  facet_wrap(~Legend) +
  coord_sf(crs=st_crs(final_net)) +
  scale_fill_viridis(name="Density") +
  labs(title = "Kernel density with 3 different search radii") +
  mapTheme(title_size = 14)

Kernel density can be viewed on the fishnet with 1500 instances of failed inspections.

as.data.frame(inspect_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
   ggplot() +
     geom_sf(aes(fill=value)) +
     geom_sf(data = sample_n(FoodInspect, 1500), size = .5) +
     scale_fill_viridis(name = "Density") +
     labs(title = "Kernel density of 2018 failed inspections") +
     mapTheme(title_size = 14)

Get 2018 failed inspection data

To evaluate the generalizability of our model we’ll apply our predictions and compare the results to 2018 failed inspection statistics.

FoodInspect19 <-
  read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5") %>%
    mutate(year = substr(inspection_date,1,4)) %>% filter(year == "2019") %>%
    # filter(results=="Fail") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Food_Inspect")

Generalizability across Race Contexts

ACS data allows us to evaluate generalizability by race contexts.

…Do we even want to include this?

One idea would be to run the model with for 2019, with the variables of 2018 except having 2019’s inspection location data (since presumably there is information as to where restaraunts are that need inspecting).

This means running the glm() function outside of the cross validation method, but I think it might be worth it. Apologies for not getting to that over the weekend!